274 ◾ Bioinformatics
website at “https://dataguide.nlm.nih.gov/edirect/install.html”. You can use the following
script to use EDirect to create a metadata file. The script searches the NCBI SRA database
for the BioProject “PRJEB24421” and then it retrieves the sample metadata and stores them
in a TSV file “sample-metadata.tsv”.
esearch -db sra -query ‘PRJEB24421[bioproject]’ \
| efetch -format runinfo \
| tr -s ‘,’ ‘\t’ > sample-metadata.tsv
Then, you can edit the file as above.
7.3.3.3 Importing Microbiome Yoga Data
Our example raw data is non-Casava 1.8 demultiplexed reads. To import the FASTQ files
into QIIME2 artifact, we need a manifest file listing the file names and their absolute path
as described above. We can create the manifest file manually or we can use the following
bash script. Before running the script, change to the “data” directory “cd data”, where the
FASTQ files are found.
#Creating a manifest file
###############################
#a- make file name and absolute path
find “$PWD”/*.fastq -type f -printf ‘%f %h/%f\n’ > tmp.txt
#b- remove _1/2.fastq
awk ‘{ gsub(/_[12].fastq/,”,”, $1); print } ‘ tmp.txt > tmp2.txt
#remove space
cat tmp2.txt | sed -r ‘s/\s+//g’ > tmp3.txt
n=$(ls -l *1.fastq|wc -l)
#create a direction column
seq $n | sed “c forward\nreverse” > tmp4.txt
#add direction column
paste tmp3.txt tmp4.txt | column -s $’’ -t > tmp5.txt
#replace space with comma
sed -e ‘s/\s\+/,/g’ tmp5.txt > manifest.txt
#add column names
sed -i ‘1s/^/sample-id,absolute-filepath,direction\n/’ manifest.
txt
rm tmp*.txt
The “manifest.txt” file will be created in “data” directory, and it looks as shown in
Figure 7.6.
After running the above script, you can display the file content using the text editor of
your choice. Then, move back to the project main directory using “cd ..”.
The next step is to import the FASTQ files into a QIIME2 artifact. To keep files orga-
nized, you can create a new subdirectory “input” for the artifact files.
mkdir inputs